========================================================

Context: The H-1B is an employment-based, non-immigrant visa category for temporary foreign workers in the United States. For a foreign national to apply for H1-B visa, an US employer must offer a job and petition for H-1B visa with the US immigration department. This is the most common visa status applied for and held by international students once they complete college/ higher education (Masters, PhD) and work in a full-time position.

Data Overview

The columns in the dataset include:

## Observations: 3,002,458
## Variables: 11
## $ X                  <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, ...
## $ CASE_STATUS        <fctr> CERTIFIED-WITHDRAWN, CERTIFIED-WITHDRAWN, ...
## $ EMPLOYER_NAME      <fctr> UNIVERSITY OF MICHIGAN, GOODMAN NETWORKS, ...
## $ SOC_NAME           <fctr> BIOCHEMISTS AND BIOPHYSICISTS, CHIEF EXECU...
## $ JOB_TITLE          <fctr> POSTDOCTORAL RESEARCH FELLOW, CHIEF OPERAT...
## $ FULL_TIME_POSITION <fctr> N, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, ...
## $ PREVAILING_WAGE    <dbl> 36067.0, 242674.0, 193066.0, 220314.0, 1575...
## $ YEAR               <int> 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2...
## $ WORKSITE           <fctr> ANN ARBOR, MICHIGAN, PLANO, TEXAS, JERSEY ...
## $ lon                <dbl> -83.74304, -96.69889, -74.07764, -104.99025...
## $ lat                <dbl> 42.28083, 33.01984, 40.72816, 39.73924, 38....

The WORKSITE variable contains both city and state information, so it’s more convenient for me to do more granular analysis if I split WORKSITE into STATE and CITY.

## Observations: 3,002,458
## Variables: 12
## $ X                  <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, ...
## $ CASE_STATUS        <fctr> CERTIFIED-WITHDRAWN, CERTIFIED-WITHDRAWN, ...
## $ EMPLOYER_NAME      <fctr> UNIVERSITY OF MICHIGAN, GOODMAN NETWORKS, ...
## $ SOC_NAME           <fctr> BIOCHEMISTS AND BIOPHYSICISTS, CHIEF EXECU...
## $ JOB_TITLE          <fctr> POSTDOCTORAL RESEARCH FELLOW, CHIEF OPERAT...
## $ FULL_TIME_POSITION <fctr> N, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, ...
## $ PREVAILING_WAGE    <dbl> 36067.0, 242674.0, 193066.0, 220314.0, 1575...
## $ YEAR               <ord> 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2...
## $ CITY               <chr> "ANN ARBOR", "PLANO", "JERSEY CITY", "DENVE...
## $ STATE              <ord> MICHIGAN, TEXAS, NEW JERSEY, COLORADO, MISS...
## $ lon                <dbl> -83.74304, -96.69889, -74.07764, -104.99025...
## $ lat                <dbl> 42.28083, 33.01984, 40.72816, 39.73924, 38....

Univariate Plots Section

H1B Visa Case Status

After plotting out the distribution of case status, we can easily see that certified cases dominate in this dataset. The bar chart above shows us the distribution of H1B visa status, a vast majority of the case status is “Certified” in this dataset. So my further analysis will only depend on CERTIFIED cases, which will provide more accurate insights to this scenario.

Full-time jobs v.s. Non full-time jobs

Next, let’s take a look at the percentages of full-time jobs and non full-time jobs. 85.8% of the total H1B visa applicants have full-time jobs, while the remaining 14.2% filed H1B visa petitions based on their part-time jobs.

Distribution of Prevailing Wage

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
##     10500     54770     65120     72550     81430 306000000

To investigate the prevailing_wage variable, histogram would be the best visualization to see the distribution. However, there are more than 3 million records in the dataset with a lot of extreme values. The alternative way to show the histogram of wage is to randomly sample about one tenth of the records and exclude the bottom 10 percent and top 5 percent data points from the sampled dataset.

Now I get the perfect histogram of prevailing wages. This right skewed distribution tells us that most foreign workers’ wages are between $60,000 and $65,000. The right tail of the distribution shows us that there are fewer foreign workers as the wages increase.

But the biggest flaw in this histogram is that I didn’t adjust the wage for inflation. This chart include all the data from 2011 to 2016.

H1B petitions by state

California has the most H1B visa petitions during the past five years, followed by Texas and New York state. Later we will dive in deeper to find out the H1B visa petitions by cities.


Univariate Analysis

What is the structure of your dataset?

There are 3,002,458 records in the dataset with 11 features (case_status, employer_name, soc_name, job_title, full_time_position, prevailing_wage, year, worksite, longitude, latitude).

There are some interesting observations derived from the univariate analysis:

  • The case status of the majority in this dataset is “Certified”.

  • Top 3 employers who filed the most H1B visa for their foreign employees are all Indian IT companies.

  • The median previaling wage for a H1B worker is $65,020 and the max price is $6,998,000,000. The interquantile range of prevailing wage is $27,061.

  • California, New York and Texas filed the most H1B visas from 2011 to 2016.

What is/are the main feature(s) of interest in your dataset?

The main features of interest in the dataset are prevailing wage and quantity of H1B petitions. I would like to see how the prevailing wage varies accross different employers and occupations. I wonder whether some employers tend to give a more generous salary package to H1B workers than others, whether particular occupations receive a higher salary than others. I’m also curious about changes in wages and quantity of H1B petitions from 2011 to 2016.

What other features in the dataset do you think will help support your
investigation into your feature(s) of interest?

SOC_NAMES, and EMPLOYERS. I would like to see how the prevailing wage varies accross different employers and occupations. I wonder whether some employers tend to give a more generous salary package to H1B workers than others, whether particular occupations receive a higher salary than others.

STATE, longitude and latitude are likely to contribute to the different wage levels and quantity of H1B petitions. My question is whether H1B workers living in areas with high living expenses tend to get higher wages correspondingly compared to those in other areas, and whether some economically developed areas are more willing to hire more foreign workers.

Did you create any new variables from existing variables in the dataset?

Yes, I split the WORKSITE variable into two new variables, CITY and STATE. With this data transformation, I can easily conduct more granular analysis based on locations.

Of the features you investigated, were there any unusual distributions?
Did you perform any operations on the data to tidy, adjust, or change the form
of the data? If so, why did you do this?

Prevailing Wage:

When I investigated the distribution of prevailing wage, I find that there are many extreme values that distort the shape of the histogram. So I digged into the data deeper and noticed that the minimum wage is $0 and the maximum wage reaches 6 billion dollars! The anomalies in the dataset severely affect my analysis, so I decide to discard these extreme values and only focus on analyzing the middle range of the wage data.

Case Status:

At first glance, it’s hard for me to believe that there are so many H1B visa petitions during the past six years because a total of 85,000 cap subject H1B visas are available and can be issued each year. Of the 85,000 cap subject visas, 65,000 are available for the Regular Cap, while 20,000 are available for the ADE (Advanced Degree Exemption) Cap.

After an extensive research, I found the answer on the Kaggle discussion forum: the data contains New H1B petitions(before the lottery) + Extension Petitions + Positions exempt from H-1B visa cap ( PHD, Researchers ). For the CASE_STATUS, “CERTIFIED”" does not mean the applicant got his/her H1B visa approved, it just means that he/she is eligible to file an H1B.


Bivariate Plots Section

Trend of H1B petitions over years

Comparison of H1B petitions between California, New York and Texas

YEAR CALIFORNIA NEW YORK TEXAS
2011 56252 35244 26851
2012 64537 37086 31841
2013 72171 36460 36408
2014 85164 42169 45091
2015 100710 47703 55066
2016 104070 51293 59694

Prevailing wages of top 10 employers

From the boxplot we can see that the median wage of Microsoft exceed all the other major sponsor companies given the wage range between $0 and $15,000. The interquantile range of prevailing wages of Tata Consultancy is the smallest compared to that of other companies, in other words, Tata Consultancy has the least variation in wages for the middle 50% of H1B workers.

Top 10 occupations with highest median prevailing wages

SOC_NAME median_wage
PHYSICIANS AND SURGEONS, ALL OTHTER 230605.0
UROLOGISTS 213158.0
STRUCTURAL METAL FABRICATORS AND FITTERS 204090.0
SECURITIES, COMMODITIES, AND FINANCIAL SERVICES S 201510.0
FAMILY & GENERAL PRACTITIONERS 194188.8
PEDIATRICIANS 189731.0
INTERNISTS 187200.0
DENTIST 182874.0
PHYSICIAN AND SURGEONS, ALL OTHER 178235.2
PHYSICIANS AND SURGEONS 174179.0

Because the dataset has a lot of outliers and is severely skewed, using the median wage as the metric to compare prevailing wages of different occupations will help reduce distortion and provide a better picture. Based on the median wage, the occupation having the highest median wage is PHYSICIANS AND SURGEONS. Out of the top 10 high income occupations, 8 are in the medical and health care field. I also find out that none of the top 10 high income occupations is included in the top 10 occupations with the most H1B petitions (computer system analysts, software developers, etc.).

Certified Cases v.s. Denied Cases

In general, the prevailing wages of denied H1B cases have more extreme values than those of certified H1B cases. The 1st quartile, median and 3rd quartile of prevailing wages for certified cases are greater than those of denied cases respectively.

Wage distribution for each year

For each year, the bulk of H1B applicants have salaries between $50,000 and $70,000. The distribution of wage for each year is right skewed.

Trend of wages from 2011 to 2016

YEAR mean_wage median_wage 10th_percentile 90th_percentile
2011 65088.39 61173 40123.0 94952
2012 66658.97 62546 42099.0 96138
2013 68426.64 63898 44616.0 98342
2014 69405.36 64688 45406.0 98675
2015 70561.24 66019 47247.2 99939
2016 72778.78 68141 48318.0 104118

This plot depicts the mean, median, 10th percentile and 90th percentile of wages for all the H1B workers in each year. From 2011 to 2016, wages for H1B applicants are increasing gradually at a reltively low rate. The mean wages being larger than the median wages every year indicates that the wage distribution is right skewed and these outliers are so extreme that they drag the mean wages up.


Bivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. How did the feature(s) of interest vary with other features in
the dataset?

  1. Considering the payments in the five years period, average payments for foreign workers with certified H1B visa are higher than those with denied H1B. Some denied cases even have payments of $0.

  2. Microsoft, one of the biggest tech companies, has the highest average wage for employees with certified H1B. Big consulting firms such as Accenture and Deloitte also tend to hire many foreign workers and offer them good salaries.

  3. Despite the fact that certified H1B applicants in the medical science and health care field account for only a small fraction of the total H1B applicants, these people earn much more than people in other fields. Physicians and surgeons have an average annual salary above $200,000.

  4. When investigating the trend of wages over the past five years, I found that the average and median wages for H1B workers are going up steadily.

Did you observe any interesting relationships between the other features
(not the main feature(s) of interest)?

California, New York and Texas are top 3 states that filed most H1B visa applications, so I decide to compare the H1B case quantity of each state per given year. From the bar plot above, it’s clear to see that California outnumbered the other two in each year. In 2011 and 2012, Texas has less certified H1B workers than New York state, but since 2013 Texas started to catch up and exceeded New York in 2014, 2015 and 2016. The number of certified H1B workers has been increasing over the past five years in all the three states.

What was the strongest relationship you found?

Year and quantity of H1B petitions are positively correlated, the same situation applies to year and prevailing wage. The US economy continues growing steadily over the past six year, bringing more job opportunities to the US job market as well as widespread wage increase.


Multivariate Plots Section

Distribution of H1B applicants in California, Texas and New York

California, Texas and New York are top 3 states that has the most H1B applicants over the past six years. I plotted the locations using longtitude and latitude variables of each record so that it’s easier to see the distribution of H1B applicants on the map.

The dots cluster around San Francisco Bay Area, Los Angelous and San Diego in California. Metropolitans in Texas that have the most H1B applicants are Dallas-Fort Worth area and Houston. For New York state, NYC including Long Island is far ahead than any other cities.

Changes in the proportion of computer and mathematical occupations

Next, I plotted a stacked bar chart to investigate the trend of the proportion of H1B applicants who have computer or mathematical occupations from 2011 to 2016. The proportion of computer or mathematical occupations over the total occupations gradually increased over years. With the flourishes in Internet industry, the demand for talents with computer-related skills has been growing year by year.

Trend of Wages for Computer or Mathematical Occupations and Other Occupations

There is an obvious increase in the wages for computer and mathematical occupations. The median wage jumped a lot from 2011 to 2012, after that we can see a steady increase.
While for other occupations, there is some fluctuations in the median wage and no apparent increase over years. Besides, variance in wages for other occupations across the six years is larger than those for computer and mathematical occupations.

Distribution of computer and mathematical occupations in the USA

STATE CITY computer_mathematical_occupations
NEW YORK NEW YORK 63053
GEORGIA ATLANTA 35727
TEXAS HOUSTON 35241
CALIFORNIA SAN FRANCISCO 34461
CALIFORNIA SAN JOSE 29342
ILLINOIS CHICAGO 25530
CALIFORNIA SUNNYVALE 22837
NORTH CAROLINA CHARLOTTE 22709
TEXAS IRVING 22622
WASHINGTON REDMOND 20391
NEW JERSEY JERSEY CITY 18925
CALIFORNIA MOUNTAIN VIEW 18393
TEXAS DALLAS 16501
WASHINGTON SEATTLE 15922
ARIZONA PHOENIX 14879
GEORGIA ALPHARETTA 14764
TEXAS AUSTIN 14677
CALIFORNIA SANTA CLARA 14636
TEXAS PLANO 14138
WASHINGTON BELLEVUE 13745

Not surprisingly, for states that filed the most H1B petitions, the numbers of computer and mathematical positions are correspondingly large. This is because computer and mathematical occupations constitute a big part in the total H1B workers’ positions.


Multivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. Were there features that strengthened each other in terms of
looking at your feature(s) of interest?

Computer and mathematical occupations have accounted for a majority of total H1B cases since 2013. As there are more and more computer and mathematical occupations appearing from 2011 to 2016, the quantity of total H1B visa petitions also shows an upward trend.

It also makes sense that the wages of the majority will have an impact on the wages of all H1B applicants. H1B applicants who have computer or mathematical occupations are often provided with a higher salary, which might potentially drive up the wage level of all H1B applicants.

Were there any interesting or surprising interactions between features?

Computer and mathematical occupations gather in metropolitan areas. More developed areas are in need of a great number of talents with technical background. As the biggest center of technology in the United States, the Bay Area (San Francisco, San Jose, Sunnyvale, Mountain View, Santa Clara) has the most H1B workers in the computer and mathematical fields.


Final Plots and Summary

Plot One

Description One

From the above unimodal, right skewed histograms of wage, we can see that the peaks on the left side is becoming higher and higher, meaning most H1B workers earn wages between $55,000 and $70,000 in each year from 2011 to 2016.

The histograms also tell us that there are more H1B workers in 2015 and 2016 than in previous years, which is an indicator of US economy recovery.

Plot Two

Description Two

Generally, wages for computer and mathematical occupations are higher than other occupations. The median wage level for H1B workers having computer or mathematical jobs is going up every year, with more and more people earning higher than $120,000 annually. During the recent years, computer techlogy is developing much faster than ever before and big data is booming as well, thus it’s not surprising to see the high demand for talents in the computer and mathematical areas.

Plot Three

Description Three

San Francisco Bay Area, New York City, Seattle, Atlanta, Washington DC, Chicago, Dallas-Fort Worth, Houston are metropolitans that have a great amount of opportunities of computer and mathematical positions. Correspondingly, the states where these metropolitans locate are popular states that filed many more H1B petitions than other states.


Reflection

When I started out by examing the structure of the H1B dataset, I dealt with missing values, outliers and filtered out irrelevant records, but I missed some inconsistent data points with mixed-case characters, which caused a big problem in my further analysis. For example, the values of SOC_NAME in 2011 contain mixed case letters but in 2016 are all uppercase letters, I got a completely wrong picture due to this mistake. The raw dataset we obtained may not be as clean as we thought, exploratory data analysis requires an open, inquisitive, and skeptical mind and strong attention to detail.

The only numerical variable directly available in this dataset is PREVAILING_WAGE, so I decided to put more effort in analyzing the wages amongst H1B workers, based on different groups. When I tried to draw a histogram of wages, I noticed that the huge amount of data combined with the existence of extremely large values slow down the process. So I randomly sampled about one tenth of all the data points and filter out the outliers. The final histogram I generated perfectly mimics the true distribution of wages for all H1B workers from 2011 to 2016.

The selection of the appropriate types of visualization should be based on the variables in the dataset. However, most of the time it is necessary to create new variables in order to make more informative data visualization. By splitting the WORKSITE variable into STATE and CITY, I’m able to draw a map displaying popular cities for H1B workers. By extracting computer and mathematical occupations from the dataset, I find that foreign workers with such kind of technical skills will have a higher chance of getting hired because of the increaing opportunities in the US job market.

One of the biggest limitations is that the dataset lacks of the academic background of H1B workers. Some achieved their university degrees in the US, while others hold their university degrees in their home countries. Some followed the career paths guided by what they learned from school, while others made a career transition after graduation. These are important information that could help me conduct drill-down analysis in terms of US employers’ preference for US universities or STEM majors.